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...SPECIFICATION can be based on some measure of similarity between the 
instructions, for example, recent branch history , stalls, instruction 
types, or other recent state history . 

Pinpointing performance bottlenecks in out-of-order processors requires 
detailed information about both stall times and concurrency levels . 
In contrast to in-order processors, a long-latency instruction is not 
problematic when there is sufficient concurrency to efficiently utilize 
the processor while the long-latency instruction is stalled. 

One approach for obtaining concurrency information is to snapshot the 
entire pipeline state. That will directly reveal where sets ... compute the 
average concurrency level when a memory access transaction "hits" in one 
of the caches , and then to compare the average concurrency level with 
the case where a cache miss is incurred. Other interesting aspects to 
examine for correlation with varying concurrency levels include register 
dependent stalls, cache miss stalls, branch-misprediction stalls, and 
recent branch history . 

An additional benefit of profiling a cluster of instructions i^ the 
ability to obtain path... 
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...SPECIFICATION data, it is not necessary to go back and review the last 
data entry to predict the set for the next data entry if this 
information is stored with the instruction ... earlier . The cache 128 is 
physically direct^mapped, but logically divided into four sets. Each 
cache line includes a data portion 114, a tag portion 112, and level-one 
set predictor. . . 

...set predictor 116. LRU information 150 is also included. The data (or 
instruction) from the cache is provided to the execution unit of the 
microprocessor on an I/O bus 118... 

...register 132, and the LI set predictor from the last access in latch 
120. The LI set predictor for the selected cache entry is then 
provided to latch 120, to override the previous entry and be ready for 
the next cache access. At the same time , the level - two set 
predictor is provided to a latch 126. 

A tag comparator 130 compares the actual input address from register 
132 on bus 134 to the tag from the cache . This is used to first 
determine if the appropriate set was properly predicted. If it... 

...correctly, a miss signal on line 121 is provided to prefetch unit 122. 
If the cache entry was an instruction, the instruction already loaded 
into the instruction buffer is invalidated. Where the cache entry was 
data, a miss signal is used to invalidate a data register where the 
cache contents were data loaded into a data register. 

Comparator 130 then compares the address to the tags for the other 
logical sets, in the cache . This can he done with 3 more accesses (or 
less, it there is a hit... 



this cache line, the SP for the previous address can be written at the 
same time . This is possible because there is a separate addressing 
input for the SP portion of the LI cache , which is addressed by an SP 
address in a register 137 as shown in Fig. 8. 

Alternately, the correction for the SP bits for either the LI SP or 
L2 SP could be provided to a write-back buffer in prefetch unit 122. 
There, it is stored along with the associated address from previous 
address register 131, for later writing back to the LI cache upon an 
empty cycle becoming available. In addition, in a preferred embodiment, 
the set prediction. . . 

.thereof. For example, the set-prediction information could be stored for ■ 
each line of a cache , or alternately for each entry or for a group of 
lines. In an alternate embodiment, the invention could be applied to a 
3rd level or Nth level cache . In addition, the present invention can be 
applied to other data structures which are not specifically labeled as 
caches . The method set forth above may be varied for different 
embodiments. For instance, the branch separately for the LI and L2 
caches , allowing one to be written while the other is not, or vice 
versa . 
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..SPECIFICATION policy known in the art. The larger L2 cache 203 holds 
more data than LI cache 202 and ordinarily controls the memory 
coherency protocol. In the present invention, the data in... stream 
buffer . 

When PCC 404 fetches data, if it is in LI cache 202 (an LI hit), it 



is sent to PCC 404. If it is not in LI cache 202 (an LI miss), but it 
is in L2 cache 203 (an L2 hit), a line of LI cache 202 is replaced 
with this subject data from L2 cache 203. In this case, the data is 
sent simultaneously to LI cache 202 and PCC 404. If there is a miss 
in L2 cache 203 as well, the data may be fetched from memory 209 into 
BIU 401 and loaded simultaneously into LI cache 202, L2 cache 203, 
and PCC 404. Variations on this operation are known in the art. Data 
store operations are similar to the fetch operations except that the 
data is stored into an LI line to complete the operation. . . 

...miss. The filter contains a number of locations that can hold such 
addresses comprising a " history " of such events. They may be replaced 
on a least recently used (LRU) basis. Whenever... 

...SPECIFICATION policy known in the art. The larger L2 cache 203 holds 
more data than LI cache 202 and ordinarily controls the memory 
coherency protocol. In the present invention, the data in... 

...stream buffer. 

When PCC 404 fetches data, if it is in LI cache 202 (an LI hit), it 
is sent to PCC 404. If it is not in LI cache 202 (an LI miss), but it 
is in L2 cache 203 (an L2 hit), a line of LI cache 202 is replaced 
with this subject data from L2 cache 203. In this case, the data is 
sent simultaneously to LI cache 202 and PCC 404. If there is a miss 
in L2 cache 203 as well, the data may be fetched from memory 209 into 
BIU 401 and loaded simultaneously into LI cache 202, L2 cache 203, 
and PCC 4 04. Variations on this operation are known in the art. Data 
store operations are similar to the fetch operations except that the 
data is stored into an LI line to complete the operation... 

...miss. The filter contains a number of locations that can hold such 
addresses comprising a " history " of such events. 
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...SPECIFICATION policy known in the art. The larger L2 cache 203 holds 
more data than LI cache 202 and ordinarily controls the memory 
coherency protocol.' In the present invention, the data in... 

...stream buffer. 

When PCC 404 fetches data, if it is in LI cache 202 (an LI hit), it 
is sent to PCC 404. If it is not in LI cache 202 (an LI miss), but it 
is in L2 cache 203 (an L2 hit), a line of LI cache 202 is replaced 
with this subject data from L2 cache 203. In this case, the data is 
sent simultaneously to LI cache 202 and PCC 404. If there is a miss 
in L2 cache 203 as well, the data may be fetched from memory 209 into 
BIU 401 and loaded simultaneously into LI cache 202, L2 cache 203, 
and PCC 404. Variations on this operation are known in the art. Data 
store operations are similar to the fetch operations except that the 
data is stored into an LI line to complete the operation... 



...miss. The filter contains a number of locations that can hold such 
addresses comprising a " history " of such events. They may be replaced 
on a least recently used (LRU) basis. Whenever... 

...SPECIFICATION policy known in the art. The larger L2 cache 203 holds 

more data than LI cache 202 and ordinarily controls the memory 
coherency protocol. In the present invention, the data in... stream 
buffer. 

When PCC 404 fetches data, if it is in LI cache 202 (an LI hit), it 
is sent to PCC 404. If it is not in LI cache 202 (an LI miss), but it 
is in L2 cache 203 (an L2 hit), a line of LI cache 202 is replaced 
with this subject data from L2 cache 203. In this case, the data is 
sent simultaneously to LI cache 202 and PCC 404. If there is a miss 
in L2 cache 203 as well, the data may be fetched from memory 209 into 
BIU 4 01 and loaded simultaneously into LI cache 202, L2 cache 203, 
and PCC 404. Variations on this operation are known in the art. Data 
store operations are similar to the fetch operations except that the 
data is stored into an LI line to complete the operation. . . 

...miss. The filter contains a number of locations that can hold such 
addresses comprising a " history " of such events. They may be replaced 
on a least recently used (LRU) basis. Whenever... 
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...SPECIFICATION R bits on DLAT hits/misses. The present invention is 
independent of the second level cache (L2) hierarchy and is only 
concerned about the concurrency of data lines in a first... 

...Patent 4,181,937 to Hattori et al. an MP cache replacement scheme in a 
two level cache hierarchy is taught. Upon the decision of replacement 
of a block from L2 shared by all processors, blocks with fewer numbers 
of copies in the first level processor caches are given higher 
preference. This is supposed to increase concurrency at LI with 
better L2 replacement strategies. The present invention is not 
concerned with L2 replacements. 

In U.S. Patent 4,503,4 97 to Krygowski et al . a cache to... 

...EX (not CH) only when it is found CH'd in the remote cache. The cache 
to cache transfer environment is not discussed in Flusche et al. The 
present invention provides a capability. . . 

...do this kind of EX (but also CH) fetch upon remote CH situations for a 
cache to cache transfer environment. This has several advantages since 
when using a cache to cache transfer facility the CH line may be ■ 
transferred to another cache as EX and CH ... restriction on concurrency of 
a cache line due to its being modified as a remote past event, and 
allows cache lines to be read only and be available to all processors... 

...SPECIFICATION R bits on DLAT hits/misses. The present invention is 
independent of the second level cache (L2) hierarchy and is only 
concerned about the concurrency of data lines in a first... 

...in the present invention is achieved by allowing a line in different 
caches in a read only state when a damaging sharing characteristic 
disappears. 

In U.S. Patent 4,181,937 to Hattori et al. an MP cache replacement 
scheme in a two level cache hierarchy is taught. Upon the decision of 
replacement of a block from L2 shared by all processors, blocks with 
fewer numbers of copies in the first level processor caches are given 
higher preference. This is supposed to increase concurrency at LI 
with better L2 replacement strategies. The present invention is not 
concerned with L2 replacements. 

In U.S. Patent 4,503,497 to Krygowski et al. a cache to... 

. ..CH) only when it is found CH'd in the remote cache. The cache to cache 
transfer environment is not discussed in Flusche et al . The present 
invention provides a capability. . . 

...do this kind of EX (but also CH) fetch upon remote CH situations for a 
cache to cache transfer environment. This has several advantages since 
when using a cache to cache transfer facility the CH line may be 
transferred to another cache as EX and CH ... restriction on concurrency of 



a cache line due to its being modified as a remote past event, and 
allows cache lines to be read only and be available to all processors. 
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Detailed Description 

Detailed Description 

storage entry is not occupied by a valid entry, an L2 predictor 
storage 260 is queried for a branch prediction entry corresponding to 
the fetch address (block 408) . In one embodiment, LI predictor storage 
206 and L2 predictor storage 2 60 may be queried in parallel . If no 
corresponding entry is present in the L2 predictor storage 2 60 (block 
41 0), a new branch prediction entry may be created in the L 1 
predictor storage- 206 for the presented fetch address (block 412) . On 
the other hand, if there exists an entry in the L2 branch predictor 
storage 260 corresponding to the fetch address, data from the L2 entry 
is utilized to rebuild a fall branch prediction corresponding to the 
fetch address (block 414). The rebuilt branch prediction is then stored 
in the LI branch predictor... 

...recovered from an L2 branch predictor storage, rather than having to 
rebuild it through a history of branch executions. Further, only a 
subset of information corresponding ...one embodiment, local predictor 
storage 206 may be organized in the same manner as instruction cache 
16. Data stored in local predictor storage 206 may consist of lines of 
storage organized. . . 

...local predictor storage 206 is of sufficient size to cover all entries 
in the instruction cache 16. In an alternative embodiment, local 
predictor storage 206 may be smaller than instruction cache 16*. For 
example, local predictor storage 206 may be 1/4 the size of instruction 



cache 16. In such an embodiment, additional bits may be stored along 
with a local prediction... 
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Fulltext Word Count: 7483 

Fulltext Availability: 
Claims 

Claim 

determined to be a 
branch instruction. 

. The method of claim 15, wherein the steps of searching the first 
level 

BPT and searching the second level BPT occur simultaneously . 

19 The method of claim 15, wherein the step of searching the first 
level BIPT 

includes the step of comparing an address tag of the IP address to an 
address tag stored in the first level BPT, and the step of searching 

the 

second level BPT includes the step of selecting an entry from a 
directmapped table. 

20 A method. . . 

...of: 

predicting the subsequent IP address to be a target address from a 
target address cache if a branch prediction entry in a first branch 
prediction table (BPT) associated with the. . .branch is not 
taken. 

22 The method of claim 20, further comprising the step of 'predicting 

the 

subsequent IP address to be the initial IP address incremented by a 
predetermined amount ... 
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Detailed Description 

this invention, a second level cache entry holds only a partial target 
address and one history bit. The predicted direction of a conditional 
branch is based simply on the direction last taken by that... 

...within a subset of the instruction address space also containing the 

branch instruction. The full predicted target ...incorporating the 

present invention; Fig. 2 is an overall block diagram of the biranch 

prediction cache (BPC) and its immediate 
environment; 

Fig. 3 is a block diagram of the first level... 

...elements within CPU 10, not such external devices. 

An Instruction Decoder (DEC) 12 performs instruction fetch , instruction 
decode, and pipeline control. DEC 12 optionally interleaves instruction 
prefetch of up to three simultaneous instruction streams. DEC 12 
contains a two - level Branch Prediction Cache (BPC) 13. lie BPC 
includes an integrated structure which contains dynamic branch history- 
data, a physical branch target address, and a branch target buffer for 
each cache entry. As branch instructions are decoded, the BPC is 
consulted for information about that branch. Independent of the direction 

predicted , branches are executed in a single cycle and do not cause 
pipeline bubbles. 

On each... in greater detail below, first level BPC 152 is a shallow but 
wide structure which caches full prediction information for a limited 
number of branch instructions. In particular, first level BPC... 

...instruction valid (TIV) bits. Second level BPC 155 is a deep but narrow 
structure which caches only partial prediction information but for a 
much larger number of branch instructions. Second level... 

...entries, each containing two bytes of partial target address information 
and one history bit. 

In parallel with instruction decoding, the instruction's decode PC is 
used to perform parallel lookups in the first and second level 
BPCs. (Since the incoming instructions have not been decoded at this 
point, non-branch instructions are also checked ) . In the event of a hit 
on first level BPC 152, the target instruction bytes are communicated 
to instruction decoder 160, the branch history. . . 

...BPC 155 is always assumed to hit. Therefore, second level BPC 155 

communicates the branch history bit to IDC 162 and the partial target 
address to IFC 165 on every access... 

...fetch address is offset from the target address by the number of target 



instruction bytes cached for that branch. In the event of a miss, the 
fetch address and the target address are the same. 

Note that prediction information for a bri^nch, i.e., a valid cache 
entry associated with the instruction, is created only after a branch is 
encountered at least once and continues to exist in the cache only 
until replaced by a set of prediction information for another branch. As 
with most caching strategies, the benefit of these cache schemes 
primarily exists on a statistical basis. When the cache does not contain 
an entry .mention above, a second level BPC entry holds only a partial 
target address and one history bit. The predicted direction of a 
conditional branch is based simply on the direction last... the number of 
cache entries. 

Operation 

A first level cache size of 36 entries and second level cache size of 
256 entries, in combination with a factor of 16 difference in per-entry 
cost, results in second-to- first level cache ratios of eight times 
the number of entries, yet still almost half the size. With this much 
larger size, even given the direct-mapped organization, the second 
level cache provides an effective backup to the first level cache. 

As each branch instruction is fetched its address is used to perform 
parallel look-ups in the two levels of BPC: the large-set or fully 
associative first level access using the full branch address; and the 
direct-mapped, tag-less second level using only a subset of the 
address bits for the index. 

If there is a tag match with a first level cache entry, then all of 
this entry's prediction information is read out, and the second level 
BPC is ignored. All the necessary predictions are made, effectively 
eliminating or hiding any... still be a delay before processing of target 
instructions can begin, if the branch is predicted taken. 



